1330218 – Shutting down I/O serving node, takes around ~9mins for IO to resume from failed over node in heterogeneous client scenarios

Bug 1330218 - Shutting down I/O serving node, takes around ~9mins for IO to resume from failed over node in heterogeneous client scenarios

Summary: Shutting down I/O serving node, takes around ~9mins for IO to resume from fai...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1278336 1302545 1303037 1354439 1363722
Blocks:	1351530
TreeView+	depends on / blocked

Reported:	2016-04-25 16:20 UTC by Shashank Raj
Modified:	2019-10-16 08:47 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	If a volume is being accessed by heterogeneous clients (i.e, both NFSv3 and NFSv4 clients), it is observed that NFSv4 clients take longer time to recover post virtual-IP failover due to a node shutdown. Workaround: Use different VIPs for different access protocol (i.e, NFSv3 or NFSv4) access.
Clone Of:
Environment:
Last Closed:	2019-04-29 12:00:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1278336	0	unspecified	CLOSED	nfs client I/O stuck post IP failover	2021-02-22 00:41:40 UTC

Internal Links: 1278336

Description Shashank Raj 2016-04-25 16:20:23 UTC

Description of problem:

Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node.

Version-Release number of selected component (if applicable):

ganesha-2.3.1-4

How reproducible:

Always

Steps to Reproduce:

1. Create a 4 node cluster and configure ganesha on it.
2. Create a dist rep 6x2 volume and mount it using vers 3 and 4 on 2 clients respectively.
3. Start creating IO's (100kb files in my case) from both the mount points.
4. Shut down the node which is serving the IO.

Performed the above scenario 3 times and observations are as below:

1st attempt:

with vers 4, IO stopped and resumed after ~17 mins.
with vers 3, IO happening continuously.

2nd attempt:

with vers 4, IO stopped during grace period and started after that.
with vers 3, IO stopped and started after ~15 mins.

3rd attempt:

with vers 4, IO stopped and resumed after ~20 mins.
with vers 3, IO happening continuously.


Actual results:

Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node

Expected results:

IO should resume as soon as grace period finishes.

Additional info:

Comment 2 Niels de Vos 2016-06-20 12:31:57 UTC

This is most likely the same as bug 1278336.

Comment 9 Manisha Saini 2016-11-28 09:38:46 UTC

With glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64

While rebooting IO serving node,it takes around ~9 minutes for IO to resume from failover node in case of NFSV4 mount.


1. Create a 4 node cluster and configure ganesha on it.
2. Create a dist rep 6x2 volume and mount it using vers 3 and 4 on 2 clients respectively.
3. Start creating IO's (100kb files in my case) from both the mount points.
4. Shut down the node which is serving the IO.

Performed the above scenario 3 times and observations are as below:

1st attempt:

with vers 4, IO stopped and resumed after ~9 minutes.
with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself.

2nd attempt:

with vers 4, IO stopped and resumed after ~9 minutes.
with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself.

3rd attempt:

with vers 4, IO stopped and resumed after ~8 minutes.
with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself.



I tried swapping clients for nfsV3 and nfsV4,Observation was same (With NFSV4,it is taking around ~9 minutes to resume with both clients)


Expected Result:
IO should resume as soon as grace period finishes

Comment 12 Manisha Saini 2016-11-29 09:45:02 UTC

Soumya,


1.Tried mounting volume on single client with NFSV4
        IO resumed after ~2 minutes

2.Tried setting time out during mounting volume on single client with NFSV4

        mount -t nfs -o vers=4,timeo=200 10.70.44.154:/ganeshaVol1 /mnt/ganesha1/
        IO resumed after ~2 minutes

Comment 13 Manisha Saini 2016-11-29 14:23:31 UTC

Based on Comment 9, as it takes around ~9 minutes for IO to resume from failover node in case of NFSV4 mount ,Reopening this Bug

Comment 14 Soumya Koduri 2016-11-29 15:15:58 UTC

Thanks for retesting Manisha.

Frank/Dan/Matt,

Do you have any comments wrt to update on comment#11 and comment#12.

Comment 15 Daniel Gryniewicz 2016-11-29 19:30:53 UTC

Without some form of logs from the failover time, I'm not sure I can say anything.

Comment 18 Bhavana 2017-03-14 01:14:26 UTC

Hi Soumya,

I have edited the doc text for the release notes. Can you please take a look at it and let me know if I need any anything more.

Comment 19 Soumya Koduri 2017-03-14 05:01:12 UTC

Hi Bhavana,

This bug was FAILED_QA as there was one outstanding issue. I changed the doc_text to reflect that. Please check the same.

<<<<
In case a volume is being accessed by heterogeneous clients (i.e, both NFSv3 and NFSv4 clients), it was observed that NFSv4 clients take longer time to recover post virtual-IP failover due to any node shutdown. 

Workaround:
To avoid that use different VIPs for different access protocol (i.e, NFSv3 or NFSv4) access.
>>>

Comment 20 Bhavana 2017-03-14 10:12:44 UTC

Thanks Soumya.

Added the doc text for the release notes.

Note You need to log in before you can comment on or make changes to this bug.